Main Road:
The main road variable is yes/no based on the street of the home. We will replace this with a dummy variable.
## mainroad n
## 1 no 50
## 2 yes 336
\(~\)
\(~\)
These are the libraries used to explore, prepare, analyze and build our models
library(tidyverse)
library(dplyr)
library(corrplot)
library(MASS)
library(dvmisc)
library(car)
library(lmtest)
library(olsrr)
library(caret)
library(kableExtra)
library(hrbrthemes)
| price | area | bedrooms | bathrooms | stories | mainroad | guestroom | basement | hotwaterheating | airconditioning | parking | prefarea | furnishingstatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 13300000 | 7420 | 4 | 2 | 3 | yes | no | no | no | yes | 2 | yes | furnished |
| 2 | 12250000 | 8960 | 4 | 4 | 4 | yes | no | no | no | yes | 3 | no | furnished |
| 3 | 12250000 | 9960 | 3 | 2 | 2 | yes | no | yes | no | no | 2 | yes | semi-furnished |
| 5 | 11410000 | 7420 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | no | furnished |
| 8 | 10150000 | 16200 | 5 | 3 | 2 | yes | no | no | no | no | 0 | no | unfurnished |
| 9 | 9870000 | 8100 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | yes | furnished |
\(~\)
Based on this our training data includes 386 records and 13 variables whereas the evaluation data includes 159 records and 13 variables.
Training:
## 'data.frame': 386 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 11410000 10150000 9870000 9800000 9800000 9681000 9310000 ...
## $ area : int 7420 8960 9960 7420 16200 8100 5750 13200 6000 6550 ...
## $ bedrooms : int 4 4 3 4 5 4 3 3 4 4 ...
## $ bathrooms : int 2 4 2 1 3 1 2 1 3 2 ...
## $ stories : int 3 4 2 2 2 2 4 2 2 2 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "yes" ...
## $ basement : chr "no" "no" "yes" "yes" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "no" "yes" ...
## $ parking : int 2 3 2 2 0 2 1 2 2 1 ...
## $ prefarea : chr "yes" "no" "yes" "no" ...
## $ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
\(~\)
Evaluation:
## 'data.frame': 159 obs. of 13 variables:
## $ price : int 12215000 10850000 10150000 9240000 9100000 8960000 8855000 8750000 8400000 8120000 ...
## $ area : int 7500 7500 8580 7800 6600 8500 6420 4320 7950 6840 ...
## $ bedrooms : int 4 3 4 3 4 3 3 3 5 5 ...
## $ bathrooms : int 2 3 3 2 2 2 2 1 2 1 ...
## $ stories : int 2 1 4 2 2 4 2 2 2 2 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "no" ...
## $ basement : chr "yes" "yes" "no" "no" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "yes" "no" ...
## $ parking : int 3 2 2 0 1 2 1 2 2 1 ...
## $ prefarea : chr "yes" "yes" "yes" "yes" ...
## $ furnishingstatus: chr "furnished" "semi-furnished" "semi-furnished" "semi-furnished" ...
\(~\)
Using the summary() function lets start exploring the
training and evaluation data.
Training:
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.00
## 1st Qu.: 3473750 1st Qu.: 3588 1st Qu.:2.000 1st Qu.:1.00
## Median : 4340000 Median : 4600 Median :3.000 Median :1.00
## Mean : 4763635 Mean : 5178 Mean :2.953 Mean :1.28
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.00
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.00
## stories mainroad guestroom basement
## Min. :1.000 Length:386 Length:386 Length:386
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.793
## 3rd Qu.:2.000
## Max. :4.000
## hotwaterheating airconditioning parking prefarea
## Length:386 Length:386 Min. :0.000 Length:386
## Class :character Class :character 1st Qu.:0.000 Class :character
## Mode :character Mode :character Median :0.000 Mode :character
## Mean :0.715
## 3rd Qu.:1.000
## Max. :3.000
## furnishingstatus
## Length:386
## Class :character
## Mode :character
##
##
##
\(~\)
Evaluation:
## price area bedrooms bathrooms
## Min. : 1767150 Min. : 1836 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:3.000 1st Qu.:1.000
## Median : 4270000 Median : 4500 Median :3.000 Median :1.000
## Mean : 4774240 Mean : 5083 Mean :2.994 Mean :1.302
## 3rd Qu.: 5771500 3rd Qu.: 6450 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :12215000 Max. :12944 Max. :5.000 Max. :3.000
## stories mainroad guestroom basement
## Min. :1.000 Length:159 Length:159 Length:159
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.836
## 3rd Qu.:2.000
## Max. :4.000
## hotwaterheating airconditioning parking prefarea
## Length:159 Length:159 Min. :0.0000 Length:159
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.6415
## 3rd Qu.:1.0000
## Max. :3.0000
## furnishingstatus
## Length:159
## Class :character
## Mode :character
##
##
##
\(~\)
It is important to recognize that this dataset contains homes with prices above 1 million. It is not clear that this is a US dataset, which would indicate that this is for luxury homes and/or high value markets.
\(~\)
The area variable appears to be square footage of the home. We would traditionally expect that increases in area would lead to increases in price.
\(~\)
While we expect increases in the number of bedrooms to increase the price, we also realize that at some point there are diminishing returns that an additional bedroom doesn’t have as much of an impact. For example, increasing from one to two bedrooms should have significant increase in price, while increasing from four to five, perhaps not so much.
## bedrooms n
## 1 1 1
## 2 2 102
## 3 3 207
## 4 4 68
## 5 5 6
## 6 6 2
Based on the distribution of the number of Bedrooms, it may be best to categorize these with dummy variables; 2, 3, and 4+.
\(~\)
Similar to the number of bedrooms, we would expect that an increase in bathroom count would lead to increases in price. Although similarly, having more than four bathrooms is likely going to lead to smaller increases.
## bathrooms n
## 1 1 288
## 2 2 89
## 3 3 8
## 4 4 1
Based on the distribution of the number of bathrooms, it may be best to categorize these with dummy variables; 2, and 3+.
\(~\)
Similar to the number of bedrooms and bathrooms, it would seem to make sense to classify homes with 3 or more floors together by introducing dummy variables; 2, and 3+.
## stories n
## 1 1 169
## 2 2 161
## 3 3 23
## 4 4 33
\(~\)
We are assuming that the parking variable represents the size of a garage. Similar to other variable the increase in price from no garage to a one car garage would be significant, while additional cars would add some lesser value. It would initially seem to make sense to introduce dummy variables; 1, and 2+.
## parking n
## 1 0 203
## 2 1 97
## 3 2 79
## 4 3 7
\(~\)
The furnishing status variable is taking on three values; unfurnished, semi-furnished, and furnished. Since we would consider unfurnished as the default state, we will use dummy variables; semi-furnished and furnished.
## furnishingstatus n
## 1 furnished 103
## 2 semi-furnished 160
## 3 unfurnished 123
\(~\)
The main road variable is yes/no based on the street of the home. We will replace this with a dummy variable.
## mainroad n
## 1 no 50
## 2 yes 336
\(~\)
The guest room variable is yes/no based on the home having a guest room. It is unclear from the dataset source if this is in addition to the number of bedrooms, but we would expect houses with a guest room to have a higher price. We will replace this with a dummy variable.
## guestroom n
## 1 no 312
## 2 yes 74
\(~\)
The basement variable is yes/no based on the home having a basement. It is unclear if having a basement or not would lead to an increase in home price, but we will replace this with a dummy variable for analysis.
## basement n
## 1 no 249
## 2 yes 137
\(~\)
Based on the distribution, we assume that the hot water heating variable represents if the house has in-floor heating, rather than forced air. Based on this assumption, we assume that having this feature would lead to higher house price. The variable will be replaced with a dummy variable for analysis.
## hotwaterheating n
## 1 no 366
## 2 yes 20
\(~\)
The air conditioning variable indicates if the house has central air conditioning. We would expect homes with air conditioning would have a higher price than those without. The variable will be replaced with a dummy variable.
## airconditioning n
## 1 no 264
## 2 yes 122
\(~\)
The dataset source doesn’t specify exactly what this variable represents. We are assuming that this is a yes/no value if the house is in a preferred neighborhood. We would expect houses with a yes to be higher price than those not.
## prefarea n
## 1 no 298
## 2 yes 88
\(~\)
Based on our exploration, we do not have any blank values in our dataset.
We will introduce a clean function to replace our categorical variables with the dummy values. This will also ensure that our test and train datasets are processed in the same way.
\(~\)
Visual evaluation:
\(~\)
After cleaning the dataset looking at a correlation plot will give us confirmation about our initial examination for the variables.
The correlation plot generally confirms our initial expectations for the data.
\(~\)
\(~\)
##
## Call:
## lm(formula = price ~ ., data = model_lin_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2778242 -618889 -69359 502507 5058478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1328656.61 1089042.62 1.220 0.223251
## area 259.46 29.19 8.889 < 0.0000000000000002 ***
## mainroad 392612.63 177649.27 2.210 0.027727 *
## guestroom 387128.55 154049.10 2.513 0.012404 *
## basement 340379.30 132393.80 2.571 0.010540 *
## hotwaterheating 454981.22 254142.03 1.790 0.074247 .
## bed2 -135521.62 1079765.76 -0.126 0.900189
## bed3 84970.58 1084129.57 0.078 0.937572
## bed4 294373.83 1095943.45 0.269 0.788388
## bed5 349023.44 1186154.54 0.294 0.768737
## bed6plus 822411.56 1320048.36 0.623 0.533666
## bath2 823033.31 148991.77 5.524 0.0000000634 ***
## bath3 1711486.52 412100.64 4.153 0.0000409553 ***
## bath4plus 5939390.79 1173229.90 5.062 0.0000006603 ***
## floor2 369286.09 145200.74 2.543 0.011397 *
## floor3 917915.07 262415.57 3.498 0.000527 ***
## floor4plus 1368891.57 247650.70 5.528 0.0000000623 ***
## car1 350724.78 139562.29 2.513 0.012404 *
## car2 597602.32 154287.27 3.873 0.000127 ***
## car3plus -694646.46 454023.80 -1.530 0.126896
## semifurnished 386745.44 133594.69 2.895 0.004023 **
## furnished 533608.97 151214.70 3.529 0.000471 ***
## ac 762389.97 133720.79 5.701 0.0000000247 ***
## neighborhood 666169.08 141563.01 4.706 0.0000036049 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1066000 on 362 degrees of freedom
## Multiple R-squared: 0.6949, Adjusted R-squared: 0.6755
## F-statistic: 35.85 on 23 and 362 DF, p-value: < 0.00000000000000022
\(~\)
##
## Call:
## lm(formula = price ~ area + mainroad + guestroom + basement +
## hotwaterheating + bed2 + bath2 + bath3 + bath4plus + floor2 +
## floor3 + floor4plus + car1 + car2 + car3plus + semifurnished +
## furnished + ac + neighborhood, data = model_lin_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2606156 -623517 -76743 477682 5170369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1427060.25 221172.14 6.452 0.00000000035 ***
## area 262.55 28.72 9.142 < 0.0000000000000002 ***
## mainroad 371018.49 174144.11 2.131 0.033795 *
## guestroom 388933.37 153592.89 2.532 0.011751 *
## basement 327059.67 131426.10 2.489 0.013271 *
## hotwaterheating 440059.27 252885.91 1.740 0.082673 .
## bed2 -242263.06 151828.03 -1.596 0.111432
## bath2 871714.49 145028.40 6.011 0.00000000447 ***
## bath3 1782594.89 398633.12 4.472 0.00001036418 ***
## bath4plus 6100918.97 1164426.07 5.239 0.00000027257 ***
## floor2 427081.88 139028.59 3.072 0.002286 **
## floor3 944710.16 259836.19 3.636 0.000317 ***
## floor4plus 1394855.82 245550.02 5.681 0.00000002741 ***
## car1 354677.24 138326.44 2.564 0.010744 *
## car2 606621.41 153181.86 3.960 0.00009002748 ***
## car3plus -694809.66 452826.31 -1.534 0.125799
## semifurnished 390379.23 132434.98 2.948 0.003407 **
## furnished 540998.79 150081.46 3.605 0.000356 ***
## ac 757477.86 133121.53 5.690 0.00000002604 ***
## neighborhood 661882.68 141156.08 4.689 0.00000388102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1064000 on 366 degrees of freedom
## Multiple R-squared: 0.6927, Adjusted R-squared: 0.6767
## F-statistic: 43.42 on 19 and 366 DF, p-value: < 0.00000000000000022
\(~\)
##
## Call:
## lm(formula = price ~ area + guestroom + basement + bath2 + bath3 +
## bath4plus + floor2 + floor3 + floor4plus + car1 + car2 +
## car3plus + semifurnished + furnished + ac + neighborhood -
## car3plus - bed2, data = model_lin_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2649096 -673238 -43507 477530 5001125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1503879.37 170520.37 8.819 < 0.0000000000000002 ***
## area 270.76 27.87 9.715 < 0.0000000000000002 ***
## guestroom 408948.95 154956.82 2.639 0.008664 **
## basement 355323.88 131895.11 2.694 0.007382 **
## bath2 890196.07 145115.53 6.134 0.00000000220 ***
## bath3 1780896.41 401261.08 4.438 0.00001198413 ***
## bath4plus 5494328.88 1105325.41 4.971 0.00000102187 ***
## floor2 550003.46 123318.63 4.460 0.00001088514 ***
## floor3 1161098.75 249130.18 4.661 0.00000440677 ***
## floor4plus 1495717.83 237291.27 6.303 0.00000000083 ***
## car1 395580.85 136995.37 2.888 0.004111 **
## car2 706158.82 151485.48 4.662 0.00000438759 ***
## semifurnished 421633.75 132631.94 3.179 0.001602 **
## furnished 569389.00 150773.07 3.776 0.000185 ***
## ac 760707.49 132467.98 5.743 0.00000001947 ***
## neighborhood 698432.62 139650.80 5.001 0.00000088145 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1074000 on 370 degrees of freedom
## Multiple R-squared: 0.6831, Adjusted R-squared: 0.6703
## F-statistic: 53.18 on 15 and 370 DF, p-value: < 0.00000000000000022
\(~\)
Verify linear modeling assumptions:
## [1] "--------------------------------------------------"
## lm(formula = price ~ area + mainroad + guestroom + basement +
## hotwaterheating + bed2 + bath2 + bath3 + bath4plus + floor2 +
## floor3 + floor4plus + car1 + car2 + car3plus + semifurnished +
## furnished + ac + neighborhood, data = model_lin_train)
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0000000000102740426604147 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.0000153934723145733 is <= 0.05 and the test statistic is 56.1639472213671, so reject the null; i.e., the residuals are HETEROSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## floor4plus floor2 bed2 furnished semifurnished
## 1.607734 1.602737 1.528509 1.502877 1.451709
## area basement ac car2 floor3
## 1.372255 1.348742 1.306490 1.302644 1.290267
## bath2 guestroom car3plus car1 neighborhood
## 1.272619 1.246735 1.245221 1.227795 1.196035
## bath4plus mainroad bath3 hotwaterheating
## 1.194896 1.166199 1.099953 1.071534
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.677"
## [1] " AIC: 11830.246"
## [1] " BIC: 11913.319"
## [1] " Mallow's Cp: 20"
## [1] " mean squared error: 1073150824664.42"
## [1] ""
## [1] "Leverage point cutoff: 0.10880829015544"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #2: 1"
## [1] " case #5: 0.219"
## [1] " case #9: 0.184"
## [1] " case #25: 0.155"
## [1] " case #33: 0.209"
## [1] " case #49: 0.128"
## [1] " case #62: 0.149"
## [1] " case #103: 0.133"
## [1] " case #110: 0.15"
## [1] " case #136: 0.149"
## [1] " case #156: 0.201"
## [1] ""
## [1] "--------------------------------------------------"
## lm(formula = price ~ area + guestroom + basement + bath2 + bath3 +
## bath4plus + floor2 + floor3 + floor4plus + car1 + car2 +
## car3plus + semifurnished + furnished + ac + neighborhood -
## car3plus - bed2, data = model_lin_train)
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0000000000133366117207675 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.000000924975310233364 is <= 0.05 and the test statistic is 56.6934694356472, so reject the null; i.e., the residuals are HETEROSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## furnished floor4plus semifurnished basement ac
## 1.487044 1.471990 1.427504 1.331772 1.268347
## area bath2 car2 guestroom floor2
## 1.267011 1.249185 1.248993 1.244114 1.236285
## car1 floor3 neighborhood bath3 bath4plus
## 1.180685 1.162893 1.147727 1.092669 1.055586
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.67"
## [1] " AIC: 11834.079"
## [1] " BIC: 11901.328"
## [1] " Mallow's Cp: 16"
## [1] " mean squared error: 1106558754047.73"
## [1] ""
## [1] "Leverage point cutoff: 0.0880829015544041"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #2: 1"
## [1] " case #5: 0.215"
## [1] " case #9: 0.143"
## [1] " case #25: 0.15"
## [1] " case #62: 0.148"
## [1] " case #110: 0.146"
## [1] " case #136: 0.146"
## [1] " case #210: 0.144"
## [1] " case #232: 0.091"
## [1] " case #355: 0.15"
## [1] ""
\(~\)
Due to non-normal distribution and heteroschedasticity of residuals, try a transform. Use Box-Cox to estimate what kind of transform is appropriate.
## Estimated transformation parameter
## Y1
## 0.08222126
##
## Call:
## lm(formula = log(price) ~ ., data = model_lin_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58768 -0.12578 -0.00157 0.13060 0.60404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.544518392 0.212111659 68.570 < 0.0000000000000002 ***
## area 0.000050911 0.000005685 8.956 < 0.0000000000000002 ***
## mainroad 0.106432736 0.034600557 3.076 0.002257 **
## guestroom 0.072860681 0.030003978 2.428 0.015653 *
## basement 0.092487862 0.025786197 3.587 0.000381 ***
## hotwaterheating 0.085236067 0.049498970 1.722 0.085928 .
## bed2 -0.017549893 0.210304814 -0.083 0.933540
## bed3 0.058477567 0.211154749 0.277 0.781983
## bed4 0.083540318 0.213455726 0.391 0.695754
## bed5 0.091411039 0.231026043 0.396 0.692578
## bed6plus 0.256317014 0.257104398 0.997 0.319461
## bath2 0.150431917 0.029018967 5.184 0.00000036205 ***
## bath3 0.302837566 0.080264398 3.773 0.000188 ***
## bath4plus 0.684383498 0.228508725 2.995 0.002933 **
## floor2 0.057366590 0.028280592 2.028 0.043243 *
## floor3 0.196068831 0.051110399 3.836 0.000147 ***
## floor4plus 0.264755508 0.048234660 5.489 0.00000007621 ***
## car1 0.071065993 0.027182396 2.614 0.009311 **
## car2 0.092977637 0.030050365 3.094 0.002128 **
## car3plus -0.109608169 0.088429727 -1.239 0.215965
## semifurnished 0.144544389 0.026020095 5.555 0.00000005383 ***
## furnished 0.136319806 0.029451924 4.629 0.00000513855 ***
## ac 0.154534113 0.026044654 5.933 0.00000000694 ***
## neighborhood 0.127050640 0.027572075 4.608 0.00000564357 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2076 on 362 degrees of freedom
## Multiple R-squared: 0.7091, Adjusted R-squared: 0.6906
## F-statistic: 38.37 on 23 and 362 DF, p-value: < 0.00000000000000022
##
## Call:
## lm(formula = log(price) ~ area + mainroad + guestroom + basement +
## hotwaterheating + bed3 + bed4 + bed6plus + bath2 + bath3 +
## bath4plus + floor2 + floor3 + floor4plus + car1 + car2 +
## semifurnished + furnished + ac + neighborhood, data = model_lin_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58509 -0.12700 0.00047 0.12936 0.60034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.535084196 0.040152865 361.994 < 0.0000000000000002 ***
## area 0.000050982 0.000005498 9.273 < 0.0000000000000002 ***
## mainroad 0.098425270 0.033966231 2.898 0.003985 **
## guestroom 0.073094605 0.029965558 2.439 0.015192 *
## basement 0.094260108 0.025696821 3.668 0.000281 ***
## hotwaterheating 0.088006202 0.049413883 1.781 0.075744 .
## bed3 0.064144656 0.028343442 2.263 0.024215 *
## bed4 0.085823272 0.039599484 2.167 0.030859 *
## bed6plus 0.263077810 0.150898238 1.743 0.082104 .
## bath2 0.153013605 0.028971974 5.281 0.00000022073 ***
## bath3 0.326495664 0.077959203 4.188 0.00003531575 ***
## bath4plus 0.586803921 0.215060901 2.729 0.006669 **
## floor2 0.066285069 0.027306452 2.427 0.015688 *
## floor3 0.208868710 0.050236039 4.158 0.00004009382 ***
## floor4plus 0.265805557 0.047723404 5.570 0.00000004958 ***
## car1 0.071743162 0.026735578 2.683 0.007619 **
## car2 0.097614195 0.029751245 3.281 0.001134 **
## semifurnished 0.142899486 0.025928567 5.511 0.00000006745 ***
## furnished 0.136023794 0.029314267 4.640 0.00000485991 ***
## ac 0.156272867 0.026009414 6.008 0.00000000454 ***
## neighborhood 0.128037921 0.027532296 4.650 0.00000463690 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2076 on 365 degrees of freedom
## Multiple R-squared: 0.7068, Adjusted R-squared: 0.6907
## F-statistic: 43.99 on 20 and 365 DF, p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = log(price) ~ area + mainroad + guestroom + basement +
## hotwaterheating + bed3 + bed4 + bed6plus + bath2 + bath3 +
## bath4plus + floor2 + floor3 + floor4plus + car1 + car2 +
## semifurnished + furnished + ac + neighborhood, data = model_lin_train)
## Warning: not plotting observations with leverage one:
## 2
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0274288341817099 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.221006362177854 and test statistic of 24.5055362047153 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## bed4 bed3 floor2 floor4plus furnished
## 2.039063 1.789964 1.624256 1.595391 1.506251
## semifurnished basement bath2 area ac
## 1.461844 1.354548 1.334192 1.321166 1.310209
## car2 floor3 guestroom car1 neighborhood
## 1.290896 1.267012 1.246656 1.204937 1.195362
## mainroad bath3 hotwaterheating bath4plus bed6plus
## 1.165520 1.105178 1.074793 1.070778 1.051587
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.691"
## [1] " AIC: -96.006"
## [1] " BIC: -8.978"
## [1] " Mallow's Cp: 21"
## [1] " mean squared error: 0.041"
## [1] ""
## [1] "Leverage point cutoff: 0.113989637305699"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #2: 1"
## [1] " case #5: 0.232"
## [1] " case #9: 0.193"
## [1] " case #25: 0.159"
## [1] " case #49: 0.124"
## [1] " case #62: 0.163"
## [1] " case #77: 0.515"
## [1] " case #103: 0.153"
## [1] " case #110: 0.155"
## [1] " case #136: 0.157"
## [1] " case #210: 0.165"
## [1] ""
\(~\)
Investigate outliers.| price | area | mainroad | guestroom | basement | hotwaterheating | bed2 | bed3 | bed4 | bed5 | bed6plus | bath2 | bath3 | bath4plus | floor2 | floor3 | floor4plus | car1 | car2 | car3plus | semifurnished | furnished | ac | neighborhood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 12250000 | 8960 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
| 8 | 10150000 | 16200 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 9681000 | 6000 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 34 | 8190000 | 5960 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 67 | 6930000 | 13200 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 90 | 6440000 | 8580 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 113 | 6083000 | 4300 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 144 | 5600000 | 4800 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 154 | 5530000 | 3300 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 196 | 4970000 | 4410 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 291 | 4200000 | 2610 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
\(~\)
Investigation of outliers reveals no obvious pattern, so we have to assume there is some other variable at play that we don’t have data for (e.g. high-end appliances, presence of a pool, property condition, etc). Well remove the outliers and re-run model.
##
## Call:
## lm(formula = formula(lm_mod5), data = model_lin_train[c(-2, -5,
## -9, -25, -49, -62, -77, -103, -110, -136, -210), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58650 -0.12611 -0.00086 0.12165 0.62001
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.552077340 0.041028854 354.679 < 0.0000000000000002 ***
## area 0.000046391 0.000005973 7.767 0.0000000000000873 ***
## mainroad 0.094132064 0.034584153 2.722 0.006812 **
## guestroom 0.078800876 0.030481181 2.585 0.010130 *
## basement 0.085255655 0.026311384 3.240 0.001307 **
## hotwaterheating 0.057759412 0.052990633 1.090 0.276456
## bed3 0.078248863 0.029177990 2.682 0.007665 **
## bed4 0.102622190 0.040833608 2.513 0.012407 *
## bed6plus 0.120709432 0.210035997 0.575 0.565853
## bath2 0.147071512 0.029194002 5.038 0.0000007515098156 ***
## bath3 -0.142253973 0.212242318 -0.670 0.503139
## bath4plus NA NA NA NA
## floor2 0.058563926 0.027551773 2.126 0.034227 *
## floor3 0.191284890 0.051435131 3.719 0.000232 ***
## floor4plus 0.260516774 0.047628306 5.470 0.0000000851899506 ***
## car1 0.075782849 0.027081148 2.798 0.005417 **
## car2 0.104147967 0.030074398 3.463 0.000599 ***
## semifurnished 0.150000199 0.026299921 5.703 0.0000000247524034 ***
## furnished 0.140301991 0.029667427 4.729 0.0000032579008657 ***
## ac 0.159133762 0.025897699 6.145 0.0000000021516006 ***
## neighborhood 0.132549803 0.027450937 4.829 0.0000020472915666 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2063 on 355 degrees of freedom
## Multiple R-squared: 0.6997, Adjusted R-squared: 0.6836
## F-statistic: 43.53 on 19 and 355 DF, p-value: < 0.00000000000000022
##
## Call:
## lm(formula = log(price) ~ area + mainroad + guestroom + basement +
## bed3 + bed4 + bath2 + floor2 + floor3 + floor4plus + car1 +
## car2 + semifurnished + furnished + ac + neighborhood, data = model_lin_train[c(-2,
## -5, -9, -25, -49, -62, -77, -103, -110, -136, -210), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58873 -0.12342 0.00127 0.12844 0.67130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.552303788 0.040688236 357.654 < 0.0000000000000002 ***
## area 0.000045884 0.000005951 7.710 0.000000000000125 ***
## mainroad 0.097414780 0.034245452 2.845 0.004702 **
## guestroom 0.076896113 0.030070499 2.557 0.010964 *
## basement 0.085065566 0.026142098 3.254 0.001246 **
## bed3 0.078025035 0.028949410 2.695 0.007366 **
## bed4 0.100743111 0.040624077 2.480 0.013602 *
## bath2 0.149409677 0.029079309 5.138 0.000000457356246 ***
## floor2 0.059460729 0.027314842 2.177 0.030143 *
## floor3 0.195417664 0.051220442 3.815 0.000160 ***
## floor4plus 0.259324130 0.047494588 5.460 0.000000089151967 ***
## car1 0.081547827 0.026701058 3.054 0.002426 **
## car2 0.106210924 0.029971879 3.544 0.000447 ***
## semifurnished 0.152667330 0.026031467 5.865 0.000000010235661 ***
## furnished 0.141615566 0.029508058 4.799 0.000002343150712 ***
## ac 0.155994408 0.025661908 6.079 0.000000003102744 ***
## neighborhood 0.130735463 0.027358190 4.779 0.000002579879496 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.206 on 358 degrees of freedom
## Multiple R-squared: 0.698, Adjusted R-squared: 0.6845
## F-statistic: 51.71 on 16 and 358 DF, p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = log(price) ~ area + mainroad + guestroom + basement +
## bed3 + bed4 + bath2 + floor2 + floor3 + floor4plus + car1 +
## car2 + semifurnished + furnished + ac + neighborhood, data = model_lin_train[c(-2,
## -5, -9, -25, -49, -62, -77, -103, -110, -136, -210), ])
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0135819482901162 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.0840661091923266 and test statistic of 24.2557228274724 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## bed4 bed3 floor2 floor4plus furnished
## 2.063505 1.834765 1.591989 1.555359 1.494596
## semifurnished area basement bath2 car2
## 1.454334 1.382581 1.372272 1.331000 1.282363
## floor3 ac guestroom car1 neighborhood
## 1.279918 1.271438 1.239238 1.191343 1.187511
## mainroad
## 1.156360
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.684"
## [1] " AIC: -101.985"
## [1] " BIC: -31.301"
## [1] " Mallow's Cp: 17"
## [1] " mean squared error: 0.041"
## [1] ""
## [1] "Leverage point cutoff: 0.096"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #48: 0.104"
## [1] " case #62: 0.102"
## [1] " case #80: 0.099"
## [1] " case #84: 0.098"
## [1] " case #221: 0.116"
## [1] ""
\(~\)
Residuals are still not normally distributed. Use robust regression
to try addressing non-normality.
##
## Call:
## lm(formula = formula(lm_mod7), data = model_lin_train, weights = lm_mod8$w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.39873 -0.13163 0.00627 0.12850 0.41563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.576837195 0.036640378 397.835 < 0.0000000000000002 ***
## area 0.000049910 0.000005072 9.840 < 0.0000000000000002 ***
## mainroad 0.084162100 0.030798460 2.733 0.006585 **
## guestroom 0.066825026 0.027117946 2.464 0.014186 *
## basement 0.092910569 0.023534417 3.948 0.00009443484523 ***
## bed3 0.047228877 0.025581852 1.846 0.065666 .
## bed4 0.071650779 0.035800323 2.001 0.046081 *
## bath2 0.145767340 0.026146814 5.575 0.00000004789330 ***
## floor2 0.087224376 0.024366301 3.580 0.000390 ***
## floor3 0.228322844 0.045009601 5.073 0.00000062234166 ***
## floor4plus 0.304215596 0.042847379 7.100 0.00000000000645 ***
## car1 0.061013969 0.024105545 2.531 0.011785 *
## car2 0.093803011 0.027068547 3.465 0.000592 ***
## semifurnished 0.136384089 0.023452238 5.815 0.00000001311493 ***
## furnished 0.139402409 0.026699824 5.221 0.00000029770305 ***
## ac 0.142091507 0.023231442 6.116 0.00000000244205 ***
## neighborhood 0.127632227 0.024725726 5.162 0.00000040033935 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1835 on 369 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.7098
## F-statistic: 59.86 on 16 and 369 DF, p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = model_lin_train, weights = lm_mod8$w)
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.00149841137153798 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.726046019909312 and test statistic of 12.2578680885363 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## bed4 bed3 floor4plus floor2 furnished
## 1.974297 1.763977 1.576573 1.555003 1.497120
## semifurnished basement bath2 area car2
## 1.454589 1.369392 1.321515 1.307867 1.268015
## ac floor3 guestroom neighborhood car1
## 1.266215 1.243657 1.227593 1.190301 1.183226
## mainroad
## 1.165607
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.71"
## [1] " AIC: -167.558"
## [1] " BIC: -96.353"
## [1] " Mallow's Cp: 158.404"
## [1] " mean squared error: 0.032"
## [1] ""
## [1] "Leverage point cutoff: 0.0932642487046632"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #53: 0.103"
## [1] " case #68: 0.102"
## [1] " case #87: 0.097"
## [1] " case #90: 0.094"
## [1] " case #103: 0.122"
## [1] ""
##
## Call:
## lm(formula = formula(lm_mod7), data = model_lin_train[c(-53,
## -68, -87, -90, -103), ], weights = lm_mod8$w[c(-53, -68,
## -87, -90, -103)])
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.39955 -0.13169 -0.00021 0.12864 0.41720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.563721647 0.037189860 391.605 < 0.0000000000000002 ***
## area 0.000051366 0.000005254 9.776 < 0.0000000000000002 ***
## mainroad 0.088784687 0.031173748 2.848 0.004648 **
## guestroom 0.065491056 0.027695872 2.365 0.018571 *
## basement 0.085828585 0.024084842 3.564 0.000415 ***
## bed3 0.053680427 0.026267071 2.044 0.041709 *
## bed4 0.073893471 0.036611812 2.018 0.044294 *
## bath2 0.142063722 0.026492819 5.362 0.0000001462959 ***
## floor2 0.084921654 0.024516679 3.464 0.000596 ***
## floor3 0.203949148 0.048755612 4.183 0.0000360781643 ***
## floor4plus 0.298337827 0.044190686 6.751 0.0000000000581 ***
## car1 0.060818177 0.024306795 2.502 0.012784 *
## car2 0.094709759 0.027112575 3.493 0.000536 ***
## semifurnished 0.141179923 0.023615878 5.978 0.0000000053860 ***
## furnished 0.143142302 0.026980739 5.305 0.0000001957569 ***
## ac 0.145829668 0.023516123 6.201 0.0000000015211 ***
## neighborhood 0.126779027 0.025075230 5.056 0.0000006799222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1835 on 364 degrees of freedom
## Multiple R-squared: 0.7225, Adjusted R-squared: 0.7103
## F-statistic: 59.24 on 16 and 364 DF, p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = model_lin_train[c(-53,
## -68, -87, -90, -103), ], weights = lm_mod8$w[c(-53, -68,
## -87, -90, -103)])
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0016011455567499 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.699944941564394 and test statistic of 12.6251150042379 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## bed4 bed3 floor4plus floor2 furnished
## 2.033378 1.833734 1.627263 1.559638 1.501208
## semifurnished basement bath2 area ac
## 1.458014 1.408550 1.329111 1.312546 1.277562
## floor3 car2 guestroom neighborhood car1
## 1.270650 1.255128 1.248403 1.199533 1.189140
## mainroad
## 1.170717
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.71"
## [1] " AIC: -164.911"
## [1] " BIC: -93.94"
## [1] " Mallow's Cp: 157.372"
## [1] " mean squared error: 0.032"
## [1] ""
## [1] "Leverage point cutoff: 0.094488188976378"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #87: 0.099"
## [1] ""
\(~\)
Perform five-fold cross validation to validate our results.
## Linear Regression
##
## 386 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 308, 309, 310, 309, 308
## Resampling results:
##
## RMSE Rsquared MAE
## 0.221452 0.6481765 0.169662
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = .outcome ~ ., data = dat, weights = wts)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.39873 -0.13163 0.00627 0.12850 0.41563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.576837195 0.036640378 397.835 < 0.0000000000000002 ***
## area 0.000049910 0.000005072 9.840 < 0.0000000000000002 ***
## mainroad 0.084162100 0.030798460 2.733 0.006585 **
## guestroom 0.066825026 0.027117946 2.464 0.014186 *
## basement 0.092910569 0.023534417 3.948 0.00009443484523 ***
## bed3 0.047228877 0.025581852 1.846 0.065666 .
## bed4 0.071650779 0.035800323 2.001 0.046081 *
## bath2 0.145767340 0.026146814 5.575 0.00000004789330 ***
## floor2 0.087224376 0.024366301 3.580 0.000390 ***
## floor3 0.228322844 0.045009601 5.073 0.00000062234166 ***
## floor4plus 0.304215596 0.042847379 7.100 0.00000000000645 ***
## car1 0.061013969 0.024105545 2.531 0.011785 *
## car2 0.093803011 0.027068547 3.465 0.000592 ***
## semifurnished 0.136384089 0.023452238 5.815 0.00000001311493 ***
## furnished 0.139402409 0.026699824 5.221 0.00000029770305 ***
## ac 0.142091507 0.023231442 6.116 0.00000000244205 ***
## neighborhood 0.127632227 0.024725726 5.162 0.00000040033935 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1835 on 369 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.7098
## F-statistic: 59.86 on 16 and 369 DF, p-value: < 0.00000000000000022
\(~\)
Compare predicted price to actual price for the training data.
\(~\)
Run selected model against validation set.
##
## Call:
## lm(formula = formula(lm_mod7), data = dfeval_clean, weights = lm_valid1$w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.35051 -0.13513 0.00779 0.11523 0.39211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.59822860 0.05808238 251.337 < 0.0000000000000002 ***
## area 0.00004290 0.00000927 4.628 0.00000825 ***
## mainroad 0.15007215 0.04437113 3.382 0.000929 ***
## guestroom 0.06257943 0.04774515 1.311 0.192076
## basement 0.09440512 0.04001255 2.359 0.019667 *
## bed3 0.06515962 0.04355248 1.496 0.136842
## bed4 0.02207513 0.05809029 0.380 0.704503
## bath2 0.18985130 0.03815093 4.976 0.00000185 ***
## floor2 0.00991900 0.03985083 0.249 0.803795
## floor3 0.15233007 0.06287172 2.423 0.016658 *
## floor4plus 0.31853494 0.08226389 3.872 0.000164 ***
## car1 0.11779078 0.04379297 2.690 0.008008 **
## car2 0.13487369 0.04527726 2.979 0.003404 **
## semifurnished 0.11957014 0.03749098 3.189 0.001755 **
## furnished 0.05721555 0.04313205 1.327 0.186797
## ac 0.18046066 0.03537770 5.101 0.00000106 ***
## neighborhood 0.08674482 0.04148445 2.091 0.038308 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1842 on 142 degrees of freedom
## Multiple R-squared: 0.7384, Adjusted R-squared: 0.7089
## F-statistic: 25.05 on 16 and 142 DF, p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = dfeval_clean, weights = lm_valid1$w)
## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.121694773959435 is > 0.05, so do not reject the null; i.e., the residuals are NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.772092876772661 and test statistic of 11.5838980266342 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
## bed4 bed3 floor2 floor3 area
## 2.046080 2.028076 1.739436 1.625511 1.590490
## basement floor4plus semifurnished furnished neighborhood
## 1.546923 1.509985 1.504967 1.482755 1.436899
## car2 bath2 guestroom car1 mainroad
## 1.317953 1.271320 1.269881 1.259422 1.234909
## ac
## 1.196139
## [1] ""
## [1] "Model scores:"
## [1] " adjusted R-squared: 0.709"
## [1] " AIC: -56.184"
## [1] " BIC: -0.943"
## [1] " Mallow's Cp: 70.416"
## [1] " mean squared error: 0.03"
## [1] ""
## [1] "Leverage point cutoff: 0.226415094339623"
## [1] ""
## [1] "First 10 points of influence:"
## [1] " case #3: 0.243"
## [1] ""
\(~\)
Compare predicted price to actual price for the evaluation data.
# Compare predicted price to actual (eval data)
dfeval_clean$pred_price <- exp(predict(lm_valid2, weights=lm_valid1$w, data=dfeval_clean, interval='prediction')[,1])
dfeval_clean %>% ggplot(mapping=aes(x=price, y=pred_price)) +
geom_point() +
geom_smooth(method='lm', se=T) +
xlab('Price') + ylab('Predicted Price') +
ggtitle('Figure 16. Predicted Price vs Price (Validation Data)') +
theme_ipsum()
\(~\)
Our model comparison below:| # | Train/Validation | Linear/Robust | Full/Step-reduced | Log Transform | Outliers Removed | Huber-Weighted | Adj R-Sqr |
|---|---|---|---|---|---|---|---|
| 1 | Train | Linear | Full | 0.675 | |||
| 2 | Train | Linear | Step | 0.677 | |||
| 3 | Train | Linear | Step | 0.670 | |||
| 4 | Train | Linear | Full | Yes | 0.691 | ||
| 5 | Train | Linear | Step | Yes | 0.691 | ||
| 6 | Train | Linear | Step | Yes | Yes | 0.684 | |
| 7 | Train | Linear | Step | Yes | Yes | 0.684 | |
| 8 | Train | Robust | Step | Yes | NA | ||
| 9 | Train | Linear | Step | Yes | Yes | 0.710 | |
| 10 | Train | Linear | Step | Yes | Yes | Yes | 0.710 |
| 11 | Validation | Linear | Step | Yes | Yes | 0.709 |